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Abstract. I show how any reversible Markov chain on a finite state space that is irreducible, and 
hence suitable for estimating expectations with respect to its invariant distribution, can be used to 
construct a non-reversible Markov chain on a related state space that can also be used to estimate 
these expectations, with asymptotic variance at least as small as that using the reversible chain 
(typically smaller). The non-reversible chain achieves this improvement by avoiding (to the extent 
possible) transitions that backtrack to the state from which the chain just came. The proof that this 
modification cannot increase the asymptotic variance of an MCMC estimator uses a new technique 
that can also be used to prove Peskun's (1973) theorem that modifying a reversible chain to reduce 
the probability of staying in the same state cannot increase asymptotic variance. A non-reversible 
chain that avoids backtracking will often take little or no more computation time per transition 
than the original reversible chain, and can sometime produce a large reduction in asymptotic 
variance, though for other chains the improvement is slight. In addition to being of some practical 
interest, this construction demonstrates that non-reversible chains have a fundamental advantage 
over reversible chains for MCMC estimation. Research into better MCMC methods may therefore 
best be focused on non-reversible chains. 

1 Introduction 

Markov chain Monte Carlo (MCMC) is widely used to estimate expectations of functions with 
respect to complex, high-dimensional probability distributions, particularly in Bayesian statistics 
and statistical physics (see, for example, Liu 2001). An MCMC estimator can be based on any 
Markov chain that is irreducible and that has the distribution of interest as its invariant distribution. 
However, the choice of Markov chain will affect the efficiency with which estimates of expectations 
with a given accuracy can be obtained. In this paper, I show that an MCMC estimator based on 
a reversible Markov chain on a finite state space can be improved in terms of asymptotic variance 
(or in degenerate cases, not made worse) by transforming it to a Markov chain on a related space 



that will be non-reversible (except when the state space has only one or two states). 

The non-reversible chains produced by this construction avoid, when possible, transitions that 
backtrack by returning to the state from which the chain just came. This is done by expanding the 
state space to pairs of states of the original chain — representing, roughly speaking, the previous 
and current states — and then updating this pair using two operations in sequence, one a swap, and 
the other a modified Gibbs sampling update of the second component that tries to avoid leaving 
the state unchanged. Many such modifications are possible; one that is generally applicable was 
introduced by Liu (1996). Though both the swap and the modified Gibbs sampling update are 
reversible, their application in sequence is not reversible. 

Simulation of this non-reversible chain will often require little or no more computation time 
than simulation of the original reversible chain. The advantage of the non-reversible chain can be 
dramatic when the original chain is such that suppressing backtracking has the effect of forcing 
movement in the same direction for many steps, thereby suppressing the slow random- walk motion 
that reversible chains arc subject to. In other cases, however, the improvement may be slight. Aside 
from possible practical applications of the particular construction I present, the results indicate that 
non-reversible chains are fundamentally superior to reversible chains for MCMC estimation, and 
hence research into improved MCMC methods may be best directed toward methods based on 
non-reversible chains. 

My proof that asymptotic variance will not increase as a result of modifying the chain to avoid 
backtracking uses a new technique based on dividing the chains into blocks delimited by transitions 
that are affected by the modification, and then showing that the only effect of the modification is 
to partially stratify the sampling for these blocks. This stratification can only decrease asymptotic 
variance, or leave it unchanged. As an introduction to this technique, I start by giving a new proof 
of Peskun's (1973) theorem that asymptotic variance will not be increased by modifying a reversible 
chain to reduce the probability of staying in the same state, while keeping the probability of other 
transitions at least as large as before. This proof gives some insight into why the hypothesis of 
reversibility is necessary for Peskun's theorem. This new proof technique holds promise for proving 
that other transformations of both reversible and non-reversible chains are also beneficial. 

2 Preliminaries 

Suppose we wish to estimate the expectation of some function, f{x), with respect to a distribution 
with probabilities tt{x), where x is in some finite space X. (Generalizations to infinite spaces will 
not be dealt with in this paper.) The MCMC approach to this problem is to simulate a Markov 
chain Xi,X2, ■ ■ ■ that has tt as an invariant distribution — that is, for which 



where T(x, y) = P{Xt-\-i = y \ Xt = x) are the transition probabilities of the Markov chain (assumed 
to be the same for all t). If the Markov chain is also irreducible (a series of transitions with non-zero 
probability connects any two states) , it will have only one invariant distribution, and the estimator 




for all y E X 
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will converge to n = E-j^lf^X)] as n goes to infinity. Furthermore, a Central Limit Theorem 
applies, showing that the distribution of fin is asymptotically normal (possibly a degenerate normal 
distribution with variance zero). These fundamentals aspects of MCMC are discussed, for example, 
by Tierney (1994) and Liu (2001). Some statements of the results mentioned above in these 
references make a further assumption that the chain is aperiodic, but this is is not essential (Hoel, 
Port, and Stone 1972, Chapter 2, Theorems 3, 5, and 7; Romanovsky 1970, Section 43). 

The asymptotic variance of the estimator @ is defined to be 



Note that this does not depend on the initial distribution for Xi. The bias of the estimator will 
be of order 1/n, so its asymptotic mean squared error will be equal to its asymptotic variance. In 
practice, rather than /!„ from we would use an estimator based only on Xt with t greater than 
some time past which we believe the chain has reached a distribution close to vr, but this refinement 
(which reduces bias) does not affect the asymptotic variance. 

Asymptotic variance can be used as a criterion for which of two Markov chains with the same 
invariant distribution is better, on the assumption that the squared error of a practical estimator 
based on k consecutive states of the Markov chain will be approximately kVoo- This is not guaran- 
teed to be true. For example, if vr is uniform over X, a Markov chain that deterministically cycles 
through all states in some order will have asymptotic variance of zero, but if the number of states in 
X is enormous, an estimator based on simulating a practical number of transitions of this Markov 
chain may have large squared error. Nevertheless, in many contexts, will be a good guide to 
practical utility, and it will be used in this paper as the criterion for comparing Markov chains. 
Asymptotic variance has previously been used to compare Markov chains by Peskun (1973) and by 
Mira and Geyer (2000), as well as by many others. 

A Markov chain is said to be "reversible" if its transition probabilities satisfy the following 
"detailed balance" condition with respect to vr: 



As a consequence, a sequence Xi, . . . ,Xn from a reversible Markov chain with Xi having initial 
distribution vr will have the same distribution as the reversed sequence of states, X^, ■ ■ ■ ,Xi. De- 
tailed balance implies that vr is an invariant distribution of the Markov chain, but the converse 
need not hold. 

Many MCMC methods use reversible Markov chains, notably the widely-used Metropolis- 
Hastings algorithm (Hastings 1970). In this algorithm, a transition from state x is performed by 
first randomly drawing a state, x*, from some "proposal distribution", with probabilities S{x,x*), 
and then accepting x* as the next state of the chain with probability 



If X* is not accepted, the next state of the chain is the same as the current state. The result is 
that for y ^ X, the transition probability is T{x,y) = S{x,y)a{x,y). One can easily show that 
these transitions satisfy detailed balance with respect to vr, and hence leave vr invariant. As a 
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for all x,y £ X 
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special case of the Metropolis-Hastings algorithm, when x consists of several components, x* might 
differ from x in only a single component, with the value for that component in x* being drawn 
from its conditional distribution under vr given the values of the other components. The acceptance 
probability of ((5)) will then alway be one. This is called a "Gibbs sampling" (or "heatbath") update 
of the component. 

Not all Markov chains used for MCMC are reversible, however. In particular, non-reversible 
Markov chains often arise as a result of applying two or more reversible transitions in sequence. If 
Ti and T2 are matrices of transition probabilities that satisfy detailed balance with respect to tt 
(and hence leave vr invariant), their product, T1T2, will also leave vr invariant, but will typically not 
satisfy detailed balance. A common example of this is when Gibbs sampling updates are applied 
to each component of state in some deterministic sequence. 

There is no reason to avoid non-reversible chains in practical applications of MCMC — what 
is essential is that the chain leave vr invariant, not that it be reversible with respect to vr. The 
non-reversibility of deterministic-scan Gibbs sampling is thought to be of little significance, but 
other non-reversible MCMC methods are designed to exploit non-reversibility to avoid the slow, 
diffusive movement via a random walk that is typical of reversible Markov chains. Examples include 
"overrelaxation" methods (Adler 1981; Neal 1998, 2003) and the "guided Monte Carlo" methods 
of Horowitz (1991) and Gustafson (1998). 

However, non-reversible Markov chains have often been avoided in theoretical discussions, since 
they are harder to analyse than reversible chains. In contrast, Diaconis, Holmes, and Neal (2000) 
analysed a particular non-reversible chain and showed that it converges to its invariant distribution 
much faster than a related reversible chain. Mira and Geyer (2000) explored whether non-reversible 
chains can be transformed to reversible chains with the same asymptotic variance, and found a 
method that sometimes does this, but not always, again showing that non-reversible chains might 
be superior to reversible chains. These results, and the practical usefulness of some non-reversible 
chains, lead one to ask whether any reversible chain can be transformed to a non-reversible chain 
that is better. With some caveats — notably, a restriction to finite state spaces — this paper 
provides an affirmative answer to this question. 

3 Peskun's theorem on modifying a reversible chain to avoid 
staying in the same state 

Before showing how to construct a non-reversible chain that is better than a given reversible chain, 
I will present Peskun's theorem, which shows that modifying a reversible chain to decrease the 
probability of staying in the same state, while keeping the probabilities of other transitions at 
least as large, cannot increase asymptotic variance. This theorem is relevant to the non-reversible 
construction that follows. I also introduce the new proof techniques I use by proving this theorem 
in Section |S1 

Theorem 1 (Peskun 1973): Let Xi,X2, ■ ■ ■ and X[,X2, . . . be two irreducible Markov chains on 
the finite state space X , both of which are reversible with respect to the distribution tt, and hence 
have vr as their unique invariant distribution. Let the transition probabilities for these chains be 

T{x, y) = P{Xt+i =y\Xt=x), T'{x, y) = P{X[^^ =y\X[ = x) (6) 
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Let f{x) be some function of state, whose expectation with respect to ir is fi. Consider the following 
two estimators for ^ based on these two chains: 



1 



n 



t=i 
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n ^ 



t=i 



If T and T' satisfy the following condition, 

T'{x,y) > T{x,y), for all x,y £ X with x ^ y 
then the asymptotic variance of ft' will be no greater than that of fi. 



(7) 



(8) 



Since it seems inefficient to stay in the same place, Peskun's theorem might seem obvious. 
Two facts show that the situation is more subtle than this. First, only the asymptotic variance is 
guaranteed not to increase if off-diagonal entries in the transition matrix are increased. The variance 
of an estimator based on finite number of transitions, started from vr, may increase (Tierney, 1998). 
Second, Peskun's theorem does not hold if the condition that the chains be reversible is omitted. 
Here is a counterexample using a non-reversible chain with four states: 



1/2 




7r(x) 



1/2 



1/3 where f{x) = 
1/6 where f{x) ^ 



Values of f{x) are shown in the circles. Solid arrows show values of both T{x,y) and T'{x,y); 
dotted arrows are for T{x,y) only; dashed arrows are for T'{x,y) only. The asymptotic variance 
is zero when using T — the chain proceeds clockwise through the four states, never backtracking 
(though sometimes staying put in a state where f{x) = fi), with the result that — /i| < 1/n. The 
modification that produces T' disturbs this cyclic behaviour, with the result that the asymptotic 
variance becomes greater than zero. 

One application of Peskun's theorem is to motivate a modified form of Gibbs sampling due 
to Liu (1996), which tries to avoid setting a component to the same value it had previously. 
Suppose, for example, that the state consists of two components, and the current state is {x, y). As 
mentioned above, a Gibbs sampling update for y can be seen as a Metropolis-Hastings update with 
a proposal distribution that keeps the first component unchanged and draws a new value for the 
second component, y* , from its conditional distribution, 7r(y|x). In Liu's modification, the proposal 
distribution is confined to values for the second component other than the current value, y, with 
the probability for proposing y* being 7r(y*|j;) / (1 — 7r(y|x)). The acceptance probability (from ((SJ) 
then becomes 



mm 



a((x,y), {x,y*)) 
In the special case that vr(y|x 



1, 



T:{x,y*)T:{y\x) / (l-7r(j/* |x)) 
7r(j;, y) ■n{y*\x) / {l-TT{y\x)) 



mm 



1, 



1 



|x) 



1 — 7r(y*|x) 



(9) 



1, we never accept the proposal (which is undefined). 
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One can easily verify that this modification increases the probabihty of a transition to ah values 
except for the current value. However, Peskun's theorem will not apply if these modified Gibbs 
sampling updates are applied in sequence (producing a non-reversible chain). Peskun's theorem 
does apply if we select a component to update at random, although this is not how Gibbs sampling is 
commonly done in practice. Liu's modification of Gibbs sampling will play a role in the construction 
of a non-reversible chain that avoids backtracking, which is presented next. 

4 Constructing a non-reversible chain from a reversible chain 
so as to avoid backtracking 

As above, suppose we have an irreducible Markov chain on a finite state space, X, with transi- 
tion probabilities given by T[x,y). Suppose also that this chain is reversible, so these transition 
probabilities satisfy the detailed balance condition @ with respect to some invariant distribution, 
TT. In this section, I show how to construct from T a non-reversible Markov chain that avoids 
backtracking. The state space for this chain will he X = {(x,?/) : T{x,y) > 0}, and it will leave 
invariant the distribution vf with probabilities defined as follows: 

Tf{x,y) = Tr{x)T{x,y) = 7r(y)r(y,x) (10) 

One can view tt as the distribution for a pair of consecutive states from the original chain, with the 
first state in the pair drawn from tt. The second formula above follows from the reversibility of the 
original chain. Note that under vf, the marginal distributions of the first and second components 
are both vr. We can therefore estimate the expectation with respect to vr of any function defined 
on X by averaging the values of either component in pairs distributed according to it. 

I will first show how to construct a chain with state space X and invariant distribution vf that is 
essentially the original chain in disguise, but which can later be modified to prevent backtracking. 
This construction is called "expanding" the chain by Kemeny and Snell (1960, Section 6.5). A 
transition of this chain consists of the following two operations, applied in sequence: 

1) Swap the two components of the state. 

2) Replace the second component of this swapped state with a new value sampled from its 
conditional distribution (under vf) given the current value of the first component. 

From ()10() . the conditional probability for the second component to be y, given that the first 
component is x, is T{x,y). The transition probabilities, T, for this chain can therefore be written 
as follows: 

T{{xo,yo), (xi,yi)) = 6{xi,yo)T{xi,yi) (11) 

where 5{x, y) is one if x = y and zero otherwise. 

The first operation above leaves vf invariant, since vf(x,y) = vf(y,x), due to the reversibility of 
the original chain. The second operation above leaves vf invariant as well, since it is simply a Gibbs 
sampling update of the second component. Applying these two operations in sequence therefore 
also leaves vf invariant. 

Although both operations above are reversible, applying them in sequence produces a non- 
reversible chain (except in degenerate situations). This non-reversibility is of no consequence. 
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however, since the above chain on X essentially replicates the operation of the original chain on X. 
Starting from state {xq,xi), the chain will proceed to states {xi,X2), (x2,X3), (3:3, 3:4) etc., with 
each xt being drawn according to the probabilities T{xt-i, xt), just as in the original chain. If we 
estimate the expectation of / with respect to vr by the average value of / applied to the second 
components of these states, the result will be exactly the same as an estimate based on states of 
the original chain. 

To obtain a more interesting non-reversible chain, we can change the second operation above to 
use the modified Gibbs sampling update of Liu (1996), discussed above in Section|51 More generally, 
we might modify the second operation in any way that reduces the probability of staying in the same 
state, while keeping the probabilities of transitions to other states at least as large as before, and 
maintaining reversibility with respect to jr. Such a modified chain, whose transition probabilities we 
will write as T', will differ substantively from the original reversible chain, since it will reduce the 
probability of "backtracking" to the state preceding the current state. For example, starting from 
state {xq,xi), a chain with transition probabilities T, equivalent to the original reversible chain, 
might proceed to state {xi,X2) and then to {x2,xi) — corresponding to the original reversible 
chain moving from xi to X2 and then back to xi. A modified chain with transitions T' might also 
proceed from {xo,xi) to (xi,X2), but after the swap operation of the next transition, the state 
{x2,xi) would be updated by a modified Gibbs sampling operation that has a reduced probability 
of leaving the second component equal to xi. 

That this avoidance of backtracking cannot increase asymptotic variance is the central result of 
this paper. This is stated in the theorem below, which is proved in Section |B1 

Theorem 2: Let Xi, X2, ■ ■ ■ be an irreducible Markov chain on the finite state space X having 
transition probabilities T(x,y) = P{Xt-\-i = y\Xt = x) that satisfy detailed balance with respect to 
the distribution with probabilities tt{x). Define a Markov chain (Xi,Yi), {X2,Y2), ...on the state 
space X = {(x,y) : T(x,y) > 0} with transition probabilities 

r'((xo,xi), (yo,yi)) = S{xi,yo)U!,^{xo,yi) (12) 

where 6{x, y) is one if x = y and zero otherwise, and U^{y, z) defines a set of probabilities for z £ X 
for any values of x,y G X , satisfying the following two conditions for all x,y,zGX with y 7^ z: 

T{x,y)U'^{y,z) = T{x, z) U^{z,y) (13) 
UUy,z) > T{x,z) (14) 

Let f{x) be some function of state, whose expectation with respect to vr is fi. Define the following 
two estimators for n based on these two chains: 

A„ = -^/(x,), a; = -E/(^/) (15) 

t=i t=i 

Then the following properties of the T' and the estimators above hold: (a) the chain with transition 
probabilities T' is irreducible; (b) the transition probabilities T' leave invariant the distribution 
with probabilities Tf{x,y) = TT(x)T(x,y); (c) if X contains at least three elements, the chain with 
transition probabilities T' is not reversible with respect to if; (d) the bias of the estimator fi'^ is of 
order 1/n; (e) the asymptotic variance of ft' is no greater than the asymptotic variance of fi. 
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The transition probabilities T (from which essentially mimic T, can also be written as 

r((xo,a;i), (yo,yi)) = K^i:yo)UxAxQ,yi), with C/a;i(a;o,yi) = T{xi,yi). The change from T to T" 
in this theorem can therefore also be seen as a change from T to T' or from U to U'. 

The new probabilities U!^{y,z) are modified update probabilities for the second component of 
state, with x being the first component of state, y the current value of the second component, and 
z a new value for the second component. These updates must satisfy detailed balance with respect 
to vf. In the case of Liu's modification, we find using @ that for all x,y,z€X with z ^ y, 



T{x,z) 
1 - T{x,y) 



mm 



1, 



1 - T{x,y) 



mm 



T{x,z) 



T{x,z) 



l-r(x,y)' l-T{x,z) 



(16) 



1-T{x,z) 

U!^{y, y) is determined from the above by the requirement that probabilities sum to one. Note that 
if T{x,y) > 1/2, this expression simplifies to T{x,z) / {1—T{x,z)). 

As a first example of such a modified chain, consider a Markov chain on X = {1, 2, . . . , A^} with 
transition probabilities of T{x, y) = 1/2 when y = x + l or y = x — 1 or x = y = or x = y = N, 
and T(x, y) = otherwise. This chain is irreducible, has the uniform distribution as its invariant 
distribution, and is reversible. From (jlOj) . we can see that for given {x,y) € X, U!^{y,z) = 1 for 
some z. The transitions, T' , of the modified chain are therefore deterministic. For N = 5, these 
transitions follow the arrows in the the diagram below: 



(1,2) 



(2,3) 



(3,4) 



(4,5) 



(1,1) 



(5,5) 



(2,1) 



(3,2) 



(4,3) 



(5,4) 



The periodic nature of the modified chain results in the asymptotic variance being zero for any 
function of state, whereas for the original chain, the asymptotic variance is of order A^^, due to its 
random walk behaviour. This example parallels the chain analysed by Diaconis, Holmes, and Neal 
(2000), with c set to zero in the definition (their equation 4.1) of their chain. Note, however, that 
the more general scheme they describe (in their section 5.1) does not correspond to the result of 
modifying a random walk Metropolis algorithm to avoid backtracking in the way described here. 

As another illustration, consider a chain on X = {1, 2, . . . , N} x {1, 2, . . . , M}, which may be 
visualized as dots arranged in an A^ by M rectangle, in which transitions go up, down, left, or 
right, with equal probabilities, except that if such a movement would leave the rectangle, the chain 
instead stays in the current state. This chain leaves the uniform distribution invariant. These 
transitions are shown below, for A = 6 and M = 3 (the unmarked transition probabilities are 1/4) : 

^/^r\ fi n n fi >^^/2 
ti ti ti ti ti ti 

tl tl tl tl tl tl 

v.Cy L> L> L> L> 
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The state space, X, of the modified chain consists of the arrows in the diagram above. The 
transitions probabihties, T', for the modified chain, based on the modified updates of (fTB|l are 
illustrated below, for two possible current states: 



A 




1/3! 








1/3 i 




T 









The current states in these diagrams are shown by sold arrows, and possible successor states by 
dotted arrows, labeled with their probabilities. 

The diagrams below show two paths within the rectangle, produced using the original chain (on 
the left) and the modified chain (on the right): 











1 


\ \ 


tl t 


tl 1 


\ It 


t i t 



Note that in two places the original chain backtracks to the preceding state. The new chain never 
backtracks in this way, but it is still possible for it to revisit states that were visited two or more 
time steps earlier. As a result, the improvement in asymptotic variance is not as dramatic as for 
the previous example. Asymptotic variance is improved only by a constant factor, which does not 
increase with and M. 

To simulate a chain that has been modified to avoid backtracking, with transition probabilities 
T', we need to be able to draw a value from X according to the probabilities U!^{y, ■). If we use 
U'^{y,z) defined by (|16j) . and if T{x,z) is non-zero for only a small, known set of z values, we can 
do this by explicitly computing the probabilities using ((TB)) . This will often be about as efficient as 
simulating the original chain. 

When T{x, z) is non-zero for many value of z, the following procedure for drawing a z value 
from U'^{y, ■ ) as defined by ((1^ may be useful. If T{x,y) > 1/2, draw a value z* according to the 
probabilities T{x,z*). If z* = y, let z = y. Otherwise, accept z* as the value z with probability 
1/(1 — T(x, z*)). If z* is not accepted, let z equal y. If instead T{x,y) < 1/2, repeatedly draw 
z* according to the probabilities T(x,z*), until a z* not equal to y is obtained (which won't take 
long). Accept this z* as the value z with probability min[ 1, (1— T(x, y)) / {1—T{x, z)]. If z* is not 
accepted, let z equal y. 

In some cases, neither of the two procedures described above for simulating from U!^{y, ■ ) may 
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be easy to implement efficiently — for instance, T{x, y) for a Metropolis-Hastings transition may 
be hard to compute when y = x, since this requires summing the probabilities of rejection for 
all possible proposals. In any case, a rigorous demonstration that a modified chain that avoids 
backtracking can be simulated as quickly as the original chain is too much to expect, because it 
might be possible to simulate the original chain especially easily using some special trick that is 
not applicable to the modified chain. 

Nevertheless, I think it is fair to say that avoiding backtracking, either using Liu's modified Gibbs 
sampling update or some other form for U!^{y,z), is not the sort of modification that inherently 
involve a large increase in computation time per transition. That this modification decreases asymp- 
totic variance (or in degenerate cases, does not increase it) is therefore an important indication that 
non-reversible chains have an advantage over reversible chains. 

5 A new proof of Peskun's theorem 

As an introduction to the techniques that will be used to prove that the no-backtracking construc- 
tion of the previous section does not increase asymptotic variance, I will here use these techniques 
to prove Peskun's theorem, stated as Theorem 1 in Section |S1 

In this proof, the "old chain" will refer to the original chain with transition probabilities T, and 
the "new chain" will refer to the chain with transition probabilities T', which may be smaller than 
those of the old chain for self transitions, but are at least as large for transitions between distinct 
states. The proof that the estimator for the expectation of any function of state using the new 
chain has asymptotic variance at least as small as the corresponding estimator using the old chain 
will proceed as follows: 

1) We reduce the problem to comparing asymptotic variances when T and T' differ only for 
transitions involving two states, A and B. 

2) We can view simulations of the old and new chains as differing only for certain "delta" 
transitions involving states A and B. 

3) These delta transitions divide the Markov chain simulation into blocks of states, which start 
and end in either state A or state B. We can rewrite the old and new estimators, fi and p,', 
in terms of the lengths of these blocks and the sums of the function values for states in these 
blocks. 

4) We see that blocks starting and ending with A and blocks starting and ending with B are 
equally likely, but may have different distributions for their contents. In contrast, blocks that 
start with A and end with B have essentially the same distribution of content as blocks that 
start with B and end with A. 

5) The only difference between the old and new chains is that in the new chain the sampling 
for "homogeneous" blocks (starting and ending in the same state) is stratified — there are 
the same number of blocks starting and ending with A as blocks starting and ending with 
B, whereas the split between these types is random in the old chain (albeit with equal 
probabilities for the two types of homogeneous blocks). 

6) Finally, this stratification will lower (or at least not increase) the asymptotic variance. 
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Step 1: Looking at one pair of states is enough 

Whenever T'{x,y) > T{x,y) for all x ^ y, we can get from T to T' by a series of steps that each 
change transition probabilities for only a single pair of states. For example, consider the following 
steps from T to T' that both satisfy detailed balance with respect to vr = [0.4 0.4 0.2]: 



0.4 
0.4 
0.4 



0.4 
0.4 
0.4 



0.2 
0.2 
0.2 



0.3 
0.5 
0.4 



0.5 
0.3 
0.4 



0.2 
0.2 
0.2 



0.3 
0.5 
0.4 



0.5 
0.2 
0.6 



0.2 
0.3 
0.0 



(17) 



Furthermore, if the detailed balance condition @ holds for T and for T', it will hold also for all 
the intermediate transition probabilities (such as those in the middle matrix above), since the pair 
of transition probabilities for any x and y (with x ^ y) at any intermediate point will be either the 
same as for T or the same as for T'. 

It is therefore enough to prove Peskun's Theorem when T and T' differ for only two states, say 
A and B. The transition probabilities for the old and new chain will then be related as follows: 



T'{x,y) = T{x,y), when x ^ {A, B} or y ^ {A, B} 

T'{A,A) = T{A,A)-6a, T'{A,B) = T{A,B)+6a 
T'{B,A) = T{B,A) + 6b, T'{B,B) = T{B,B)-5b 



(18) 



where 5a and 5b are positive. 



Step 2: Marking "delta" transitions 

Transitions T and T' differ only if the current state is A or B, and then only with respect to how a 
probability mass of 5a or 5b is assigned to new states A or B. We can mark such "delta" transitions 
while simulating the Markov chain. 

The standard way to simulate a Markov chain is as follows: For each state, x, partition the 
interval [0,1) into intervals [£(x, y), /i(x, y)) such that h{x,y) — i{x,y) = T{x,y); to simulate a 
transition out of state x, generate a random variate, U, that is uniformly distributed on [0, 1), and 
move to the state, y, for which i{x, y) < U < h{x, y). We can choose to simulate the old transitions, 
T, using partitions in which i{A, A) = i{B, B) = 0. With such a choice, we can write the algorithm 
for simulating a transition of the old chain in the manner on the left below, in which a slight change 
yields the simulation algorithm for the new chain shown below on the right: 

Old chain: New chain: 

t/ - Uniform(0, 1) f/ - Uniform(0, 1) 

if X( = A and U < 5a then \^ Xt = A and U < 5a then 

Xt_|_i = A, mark this transition Xt+i = B, mark this transition 

else \^ Xt = B and U < 5b then else \f Xt = B and U < 5b then 

Xt+i = B, mark this transition Xt+i = A, mark this transition 

else else 

Xt+i - y such that U G [l{Xu y),h{Xt, y)) Xt+i = y such that U G [eiXt, y), h{Xt,y)) 

Clearly, T and T' differ only for the "delta" transitions marked above. 



11 



Step 3: Using delta transitions to define blocks 

We can use the markings of delta transitions to divide a simulation of one of these Markov chains 
into "blocks" of consecutive states, that both start and end with either state A or state B. Note 
that states A and B may also occur at places other than the start and end of a block. It is possible 
for a blocks to consist of only a single A or a single B. 

Since asymptotic variance does not depend on the initial state distribution, let's suppose that 
P{Xi = A) = P{Xi = B) = 1/2, so that the chains will begin at the start of a block. 

For the old chain, with transitions T, we might see blocks like this: 





A B 


B B B B 


B B 


B A 


A A 


A A 


A B B A 


A A 


A 


For the new chain, with transitions T', the blocks might look like this: 




A B 


A A B B 


A A 


B A 


B A 


B B 


A A B A 


B B 


A 



The difference is that in the old chain, the state stays the same when crossing a block boundary, 
whereas for the new chain, it changes from A to or from B to A. 

We can view the simulation in terms of these blocks, and write the estimates jl and jl' in terms 
of the lengths of the blocks and the sums of / for states in these blocks. For the old chain, 

fc k 

fin « ^H,/^L, (19) 

1=1 i=l 

where Hi is the sum for f{Xt) for states Xt in block i, Li is the length of block i, and k is the 
number of blocks in the n iterations of the chain. The equality is only approximate because there 
may be a partial block after block k. Estimation in terms of blocks is discussed further in Step 6 
of the proof. 

Step 4: Probabilities of the four types of blocks and their contents 

Blocks come in four types — AA, BB, AB, BA — based on start and end states. For both the 
old and new chains, the probabilities of these types (ie, their frequencies of occurrence in a long 
realization of the chain) satisfy 

P{AA) = P{BB) and P{AB) = P{BA) (20) 

We can show this using the fact that both T and T' leave tt invariant. In particular, for the old 
chain, 

■k{B) = 7r{A)T{A,B) + ■k{B)T{B,B) + ^ 7r{x)T{x,B) (21) 

x^{A,B} 

while for the new chain, using the relationships in (|18() . 

tt{B) = tt{A){T{A,B) + 6a) + tt{B){T{B,B)-6b) + Yl <x)T{x,B) (22) 

x(^{A,B} 

from which it follows that tt{A)6a_ = tt{B)6b- 
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This lets us show that for a state, Xt, from the old chain (with t being large), 

P{Xt starts block with A) = P{Xt^i=A) P(delta transition at i-1 1 Xt-i = A) = Tr{A) 5a (23) 
P{Xt starts block with B) = P{Xt-i = B) P(delta transition at t-1 1 Xt-i=B) = Tr{B) 5b (24) 

and hence P{Xt starts block with A) = P{Xt starts block with B). In the same way, we see that 
P{Xt ends block with A) = P{Xt ends block with B). It follows that 

P{AA) + P{AB) = P{BB) + P{BA) and P{AA) + P{BA) = P{BB) + P{AB) (25) 

so P{AA) = P{BB) and P{AB) = P{BA). 

Similarly, for a state, Xf, from the new chain (with t being large), 

P(Xt starts block with yl) = P(Xt_i =B) P(delta transition at t-1 1 = S) = it{B)5b (26) 
P{Xt starts block with B) = P{Xt-i = A) P(delta transition at t-1 1 Xt-i = A) = 7t{A) 5a (27) 

and hence P{Xt starts block with A) = P{Xt starts block with B) for the new chain as well, and 
similarly P{Xt ends block with A) = P{Xt ends block with i?), from which it again follows that 
P{AA) = P{BB) and P{AB) = P{BA). 

Although blocks of type AA and blocks of type BB are equally common, the distributions for 
their contents — and hence for their length and for the sum of values of f{x) over states in the 
block — will generally be different. In contrast, blocks of type AB and blocks of type BA have 
the same distribution of content — except that the BA blocks are the reversals of the AB blocks, 
which has no effect on the sum of /(x) for states in the block. This equivalence of AB and BA 
blocks is a consequence of the chains being reversible, and holds for both the old and new chains. 

To illustrate: The probability of block AQB occurring at some large time t in the old chain is 

P{Xt = A&i block starts) P^Xt+i =Q\Xt = A)P{Xt+2 = B k. block ends | Xt+i = Q) 

= Tr{A)5AT{A,Q)T{Q,B)5B = 5a6b 7t{A)T{A,Q)T{Q, B) (28) 

= 5a5bT{Q,A)7:{Q)T{Q,B) = 5a5bT{Q,A)T{B,QXB) (29) 

= 'k{B)5bT{B,Q)T{Q,A)5a (30) 

which is also the probability of block BQA occurring at time t. For the new chain, the probability 
of block AQB occurring at time t is 

P{Xt = Ak block starts) P(Xt+i = Q\Xt = A)P{Xt+2 = B k block ends | Xt+i = Q) 

= 7riB)5BT{A,Q)T{Q,B)5B = n{A) 5aT{A,Q)T{Q, B) 6b (31) 

which is the same as for the old chain, and the same as for block BQA. 

Step 5: In the new chain, sampUng for homogeneous blocks is stratified 

Rather than simulate the chains one state at a time, let's imagine simulating the chain block by 
block. To show the relationship between the old and the new chains, I'll show how this simulation 
can be done in a coupled fashion. 
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To do this, we will need the probability that a block is "homogeneous" 
same state it begins with — which is 

P{AA) 



that it ends with the 



P(ends with A \ starts with A) 



P{AA) + P{AB) 

P{BB) 
P{BB) + P{BA) 



= P(ends with B \ starts with B) (32) 



Call this probability h, and note that it is the same for the old chain and the new chain, since the 
transitions within a block, and the marking of its end, are the same for both chains. Note as well 
that the distribution of the contents of a block, given its type, is the same for the old chain and 
the new chain. 

We can now simulate block transitions for the "old" and "new" chains as follows. We'll assume 
H below is sampled the same for both chains, but that the simulation of the contents of blocks is 
not coupled between the old and new chains. 



Old chain: 

H - Bernoulli(/i) 
if 7/ = 1 then 

if previous block ended with A 
simulate an AA block 

else 

simulate a BB block 

else 

if previous block ended with A 
simulate an AB block 

else 

simulate an AB block, then reverse it 



New chain: 

H - Bernoulli(/i) 
\f H =1 then 

if previous block ended with A 
simulate a BB block 

else 

simulate an AA block 

else 

if previous block ended with A 

simulated an AB block, then reverse it 
else 

simulate an AB block 



Comparing the simulations for the old and new chains, we see that they produce the same 
sequence of homogeneous/non-homogeneous blocks. However, for the new chain, the homogeneous 
blocks alternate between AA blocks and BB blocks. This is true both when one homogeneous 
block follows another, and when any number of non-homogeneous blocks intervene. In the old 
chain, the type of homogeneous block changes only when an odd number of non-homogeneous 
blocks intervene. This is illustrated below: 



Old chain: 


A B 


B B 


B B 


BB 


BB 


B A 


A A 


A A 


A A 


AB 


B A 


A A 


A A 




































New chain: 


A B 








MHa B 




hhJHH ba 


B AHM 





Because AA blocks alternate with BB blocks in the new chain, sampling within the new chain 
is stratified in this respect — that is, the number of AA blocks will be equal to the number of BB 
blocks (plus or minus one). We can also see this by noting that in the new chain, every block ending 
in A (except the last) is paired with a following block beginning with B. Letting Nj^a, Nab-, ^BA^ 
and Nbb be the numbers of blocks of each type, it follows that 



{Naa+Nba) - {Nbb + Nba) \ < 1 



(33) 



and hence \Naa — NbbI < 1- 
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Step 6: Stratification of homogeneous blocks won't increase asymptotic variance 

The intuition behind the proof is now complete: Sampling with the new chain is stratified with 
respect to blocks of type AA and BB. Furthermore, this is the only difference between the old and 
new chains, since the sum of the values of / for states in a block has the same distribution for AB 
blocks and BA blocks. Since we expect that stratification will not increase asymptotic variance 
(and will typically decrease it), the asymptotic variance for the new chain should be no larger than 
for the old chain. 

To justify this formally, we need two lemmas whose detailed statements, proofs, and applications 
to this proof are found in the Appendix. The first lemma says that the asymptotic variance 
(appropriately defined) of an estimator based on a simulation that continues for some specified 
number of blocks is the same as that of an estimator based on a simulation that continues for some 
specified number of Markov chain transitions. This is true because the Central Limit Theorem 
implies that asymptotically there is very little difference between simulating for a specified number 
of blocks and simulating for the number of transitions equal to the expected number for that many 
blocks. 

Accordingly, we can compare the old and new chains in the context of simulations that continue 
for a specified number of blocks. The second lemma justifies the idea that the stratification of A A 
and BB blocks will not increase the asymptotic variance of estimators based on such simulations. 
Stratification is of course well-known to be beneficial (or at least not harmful) in the context of 
independent sampling from two populations. The lemma shows that this continues to be true when, 
as here, the stratification is only partial (the ratio of homogeneous to non-homogeneous blocks is 
not fixed), sampling has a Markov chain aspect rather than being independent, and the estimator 
takes the form of a ratio rather than a linear function of the sampled variables. 

This proof provides some insight into why Peskun's theorem needs the premise that the chains 
are reversible with respect to vr, rather than the weaker premise that they leave vr invariant. This 
premise is used at two points in the proof. In Step 1, the reduction to old and new chains that differ 
only for transitions involving two states would not be possible for non-reversible chains, since there 
would be no guarantee that the intermediate chains linking old and new chains differing for several 
pairs of states would leave vr invariant. This is not relevant to the counterexample in Section |3J 
however, since it involves two chains that already differ only with regard to transitions between 
two states (state A on the left and B on the right). 

The reason the new chain in the counterexample has higher asymptotic variance relates to the 
second use of reversibility, in Step 4, where the contents of blocks of type BA are seen to have 
essentially the same distribution as the contents of blocks of type AB, apart from a reversal that 
does not affect the sum of /(x), which is what matters for the estimates. For the counterexample, 
the sum of /(x) for blocks of type AB will always be whereas this sum will always be —1 for 
blocks of of type BA. In contrast, the sum of /(x) for blocks of type AA or type BB will always 
be 0. Examining Step 5 of the proof, one can see that while the new chain stratifies sampling 
for blocks of type AA and BB, the old chain stratifies sampling for blocks of type AB and BA. 
For a non-reversible chain, stratifying between types AB and BA may be more important, so it is 
possible for the old chain to have lower asymptotic variance than the new chain. 
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6 Proof that modifying a reversible chain to avoid backtracking 
doesn't increase asymptotic variance 

We are now in a position to prove the main result of this paper, Theorem 2 in Section ^ I wih 
address the five claims in the theorem in order. The proof of the final and principal claim (e), that 
the modified chain has asymptotic variance at least as small as the original chain, will follow closely 
the proof of Peskun's theorem presented in the previous section. 

Claim (a): The modified chain, with transition probabilities T' is irreducible 

Let (a, 6) and (c, d) be distinct states in X. We need to show that (a, 6) and (c, d) are linked by 
transitions with non-zero probability under T' . From the definition of X, T(c, d) > 0. If 6 = c, 
then T'{(a,b), {c, d)) = U^{a,d) > T{c,d) > 0. Otherwise, from the irreducibility of T, there 
exist states xi, . . . ,Xk in X with T{b, xi) > 0, T{xk, c) > 0, and T(xj, rcj+i) > for i = 1, . . . , k—1, 
and hence (6,xi), (xi,X2), {xk,c) £ X. Furthermore, 

T'((a,6),(5,xi)) = Ui{a,xi) > T{b,xi) > (34) 
f'i{b,xi),ixi,X2)) = K,{b,X2) > r(xi,X2) > (35) 

T'((xfc,c),(c,d)) = U',ixk,d) > T{c,d) > (36) 

Claim (b): The transition probabilities T' leave vf invariant 

This is implied by the way T' was constructed in Section^ It can also be shown directly as follows: 

^ Tf{xo,yo)f'{{xo,yo),{xi,yi)) = ^ TT{xo)T{xo,yo) 5{yo,xi)U!,^{xo,yi) (37) 

= '^{xo)T(xo,xi)U!,^{xo,yi) (38) 

= '^{xi)T{xi,xo)U!,^{xo,yi) (39) 

= ^ 7r(xi)r(xi,yi)C/^^(2/i,xo) (40) 

= 7r(xi)T(xi,yi) ^ K^{yi,xo) (41) 

Xo&X 

= 7r(xi)T(xi,yi) = 7f(xi,yi) (42) 

Claim (c): If X contains at least three elements, T' is not reversible 

Let a, b, and c be three distinct elements of X. Since the original chain is irreducible, either 
T(a, 6) > or there exist distinct xi, . . . , x„ such that T{a, xi) > 0, T{xn, b) > and r(xj, Xj+i) > 
for i = 1, . . . , n—1. Similarly, either T(6, c) > or there exist distinct yi, . . . , y^ linking b to c. One 
way or another, we can find distinct x, y, z such that T(x, y) > and T{y, z) > — if r(a, b) = 0, 
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take three consecutive states from a, xi, . . . , x„, c; if T(6, c) = 0, take three consecutive states from 
b,yi, . . . ,ym,c; and if T{a,b) > and T{b,c) > 0, use a,b,c. The states (x,y) and {y,z) are 
in X, and have positive probabihty under vf. (Note that all states in X have positive probability 
under vr, since the original chain is irreducible.) It follows that T'{{x, y), (y, z)) is positive. However, 
T'{{y, z), (x, y)) is zero, since z ^ x. The modified chain with transition probabilities T' is therefore 
non-reversible. 



Claim (d): The bias of the estimator fi'^ is of order 1/n 

The modified chain leaves invariant the distribution 7f(x,y) = iT{x)T{x,y) = ii{y)T{y,x). The 
marginal distribution for the second component of state under vf is 7f(a;, y) = ii{y). An MCMC 
estimator that looks at a function of the second component of state will therefore converge to the 
correct expectation of this function with respect to vr, with bias of order in accordance with 
the standard properties of MCMC estimators, as discussed in Section [3 



Claim (e): The asymptotic variance of (1' is no greater than that of fi 

This is the principal claim. Its proof will follow the same steps as the proof of Peskun's theorem 
in Section |S1 



Step 1: Looking at one pair of states is enough 

In Section 01 a chain on X was defined that was essentially equivalent to the original chain on 
X, with transitions T. The transition probabilities for this chain were defined in equation Hll|) to 
have the form r((xo,yo)) {^i^yi}) = S{xi,yo)T{xi,yi), which can be seen as an instance of the 
definition of T' in equation ()12() . with Uxj^{xo,yi) = T{xi,yi). We can therefore view Theorem 2 
as claiming that changing from this U to some other U' that satisfies conditions p3() and H14() 
will not increase asymptotic variance. More generally, we will see that a change from transitions T 
based on any U satisfying condition (|13() to transitions T' based on some other U' that also satisfies 
this condition and for which 

Kiy, z) > Ux{y, z), for ah x,y,z £ X with y ^ z (43) 

will not increase asymptotic variance. 

Any such change from U to U' can be expressed as a sequences of changes, each of which affects 
Ux{y,z) only when x is some particular state O, and y and z are both either state A or state B. 
In order for condition (|13|) to be satisfied, such U and U' must be related as follows: 

Ux{y,z) = Ux{y,z), when x O or x ^ {A, B} ov y ^ {A, B} 

u;^{A,A) = Uo{A,A)-6a, U'o{A,B) = Uo{A,B) + 6a ^^^^ 
U;^{B,A) = Uo{B,A)+6b, U'o{B,B) = Uo{B,B)-6b 

The resulting transition probabilities, T and T', will be identical except for the following states: 

f'{{A,0),iO,A)) =f{{A,0),{0,A)) -6a, f'{iA,0),{0,B)) =f{{A,0),{0,B)) + 6a , ^ 

(45) 

r'((i?, O), (O, A)) = TUB, O), (O, A)) + 6b, T'^B, O), (O, B)) = T{{B, O), (O, B)) - 6b 
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The rest of the proof will assume that T and T' differ only as above. The chain with transition 
probabilities T will be referred to as the "old" chain, while that with transitions probabilities T' 
will be called the "new" chain. 

Steps 2 &; 3: Defining blocks delimited by "delta" transitions 

We can mark the transitions in a realization of either the old or the new the chain that would have 
been different in the other chain — ie, those transitions where the addition or subtraction or 5a or 
6b in (|46|) would have made a difference. This can be done using a simulation procedure entirely 
analogous to that described in Step 2 of the proof of Peskun's theorem. 

As in Step 3 of that proof, we can use these "delta" transitions to define the boundaries between 
"blocks" of states in a realization of the old or new chain. Such blocks will always begin with state 
(O, A) or (O, B) and end with state {A, O) or (5, O). If we assume that we start the chain in state 
(O, A) or (O, B), a typical sequence of blocks for the old chain might look like 





(0,A) (B,0) 


(0,B)(A,0) 


(0,A) (A,0) 


(0,A) (A,0) 


(0,A) (B,0) 


(0,B) 


whereas a typical block sequence for the new chain 


might look like 








(0,A) (B,0) 


(0,A) (A,0) 


(0,B) (B,0) 


(0,A) (B,0) 


(0,A) (B,0) 


(0,A) 



Note that states {0,A), (^,0), {0,B), and {B,0) may occur within a block, as well as at the 
beginning and end. The difference between the two chains is that states in the old chain on each 
side of a block boundary are simply reversals of each other, whereas in the new chain, one of the 
states on the two sides of a boundary will contain A and the other B. 

Step 4: Probabilities of the four types of blocks and their contents 

Blocks starting with (O, A) and ending with (A, O) will be called AA blocks, those starting with 
(O, A) and ending with (B, O) will be called AB blocks, and similarly for blocks of types BB and 
BA. The way these blocks are produced is illustrated in the diagram below: 

AB block 



BA block 



AA block 




BB block 
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An AA block starts with state {0,A), represented in the diagram by an arrow from O to A. 
Transitions from this state lead to an arrow from A to some other state, then arrows between other 
states, and eventually to an arrow from some state to A, followed by an arrow from A to O, which 
represents the the state {A,0). For the block to end here, a delta transition must occur at this 
point, which happens with probability 5a- An AB block also starts with an arrow from O to A, 
but ends with and arrow pointing to state B, and then an arrow from B to O that is followed by 
a delta transition. 

To find how the probabilities of the four types of blocks, PiAA), P{AB), P{BA), and P{BB) 
are related, we can start by noting that since condition p3|) applies to both U and U' , which are 
related by (jSl), it follows that 

TiO,A)6A = T{0,A) iU'oiA,B)-Uo{A,B)) 

= T{0,B) {U'o{B,A)-Uo{B,A)) = T{0,B)6b (46) 

Next, note that the probability that a block beginning with (O, A) starts at time t (with t being 
large) in the old chain is 

iT{A,0))6A = 7t{A)T{A,0)6a = 7t{0)T{0,A)6a (47) 

Using ()46() . we see that this is equal to the probability that a block beginning with (0,-B) starts 
at time t, which is 

tt{B,0)5b = tt{B)T{B,0)5b = tt{0)T{0, B) 5b (48) 
Similarly, the probability that a block ends with state {A, O) at time t, 

iT{A,0)6A = 7t{A)T{A,0)6a = tt{0)T{0,A)6a (49) 

is equal to the probability that a block ends with state {B, O) at time t, 

tt{B,0)5b = tt{B)T{B,0)6b = ^{0)T{0,B)5b (50) 

It follows that P{AA) + P{AB) = P{BB) + P{BA) and P{AA) + P{BA) = P{BB) + P{AB) for 
the old chain, from which we can conclude that P[AA) = P[BB) and P{AB) = P{BA). A similar 
argument shows this for the new chain as well. 

The distribution of the contents of blocks of type AA may differ from that for the contents of 
blocks of type BB, but blocks of type AB and blocks of type BA can be viewed as reversals of 
each other. Consider, for example, a block consisting of the following states: 

(o,A), (AX), (x,y), (y,s), (b,o) (51) 

As seen above, the probability of a block starting with (O, A) at some given time is vr(^) T{A, O) 5a 
(in both the old and the new chain, as seen from H46() and the reversibility of T). Multiplying by the 
probabilities of the subsequent transitions, and the probability of a delta transition from {B,0), 
the probability of the AB block above occurring at a given time in the old chain is 

7r{A) T{A, O) 5a Ua (O, X) Ux {A, Y) Uy {X, B) Ub {Y, O) 5b (52) 
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This is also the probabihty in the new chain, since U and U' are the same for ah except delta 
transitions. Now consider the reversal of the block in (jSlf) : 



(0,5), {B,Y), {Y,X), {X,A), {A,0) 



(53) 



The probability of this block occurring at a given time is 



7t{B) T{B, O) 6b Ub{0, Y) Uy{B, X) Ux{Y, A) Ua{X, O) 6a 



(54) 



We can see that H52|) and H54|) are equal as follows, using the reversibility of T with respect to tt 
and the fact that U satisfies condition 



Step 5: Sampling in the new chain is stratified 

As was done in the proof of Peskun's theorem, we can now imagine simulating both the old chain 
and the new chain one block at a time. For each block, we decide whether it should be homogeneous 
(of type AA or BB) or non-homogeneous (of type AB or BA). The probability of a block being 
homogeneous is the same regardless of whether the block starts with A ot B, since the results in 



Step 4 imply that P{AA) / {P{AA) + P{AB)) = P{BB) / {P{BB) + P{BA)). The probability 



that a block is homogeneous is also the same for the old chain and the new chain. If we make the 
same random decisions as to whether or not blocks are homogeneous in the old and new chains, 
the sequence of homogeneous versus non-homogeneous blocks will be the same for the two chains. 

The only significant difference between the old and new chains is that in the new chain the 
sequence of homogeneous blocks alternates between AA blocks and BB blocks, and hence the 
number of AA blocks is equal to the number of BB blocks (or differs by only one). This arises 
for exactly the same reasons as in the proof of Peskun's theorem — if Naa, Nab, ^ba, and A''^^ 
are the numbers of blocks of each type, | (iV^^-|- A''^^) — (A^bb + ^ba) | < 1 in the new chain, 
and hence \Naa ~ ^bbI ^ 1- The sampling for AA and BB blocks is therefore stratified in the 
new chain, but not in the old chain. The old chain stratifies sampling of AB and BA blocks, but 
since the distributions for the contents of blocks of types AB and BA are the same (apart from a 
reversal, which doesn't affect sums of function values), this stratification in the old chain has no 
effect. 



^(.4) T{A, O) Ua{0, X) Ux{A, Y) Uy{X, B) Ub{Y, O) 

= tt{A)T{A,X)Ua{X,0) • Ux{A,Y)Uy{X,B)Ub{Y,0) 

= Ua{X,0) • tt{X)T{X,A)Ux{A,Y) • UY{X,B)UBiY,0) 

= Ua{X,0) • tt{X)T{X,Y)Ux{Y,A) • Uy{X,B)Ub{Y,0) 

= Ux{Y,A)Ua{X,0) ■ 7:{Y)T{Y,X)Uy{X,B) ■ Ub{Y,0) 

= Ux{Y,A)Ua{X,0) • 7r{Y)T{Y,B)UY{B,X) • Ub{Y,0) 

= Uy{B,X)Ux{Y,A)Ua(X,0) ■ 7r{B)T{B,Y)UB{Y,0) 

= Uy{B,X)Ux{Y,A)Ua{X,0) ■ 7r{B)T{B,0)UB{0,Y) 

= 7t{B) T{B, O) Ub{0, Y) Uy{B, X) Ux{Y, A) Ua{X, O) 



(55) 
(56) 
(57) 
(58) 
(59) 
(60) 
(61) 
(62) 
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Steps 6: Stratification will not increase asymptotic variance 

Finally, as in the proof of Peskun's theorem, we can apply Lemma 1 in the Appendix to show that 
the asymptotic variance using a simulation that continues for a specified number of blocks is the 
same as that using a simulation for a specified number of transitions. We can then apply Lemma 2 
to show that the block-by-block simulation of the new chain, which is stratified with respect to AA 
and BB blocks, will have asymptotic variance at least as small as for the old chain. 

7 Conclusion 

This paper shows how any reversible Markov chain can be transformed into a non-reversible chain 
that tries to avoid backtracking to the state visited immediately before. This transformation 
never increases the asymptotic variance of an MCMC estimator using the chain, and will usually 
decrease it. Sometimes, the decrease in asymptotic variance is dramatic, but other times it is small. 
In general, one would expect the decrease in asymptotic variance to be small when a state of the 
original Markov chain has many possible successor states (of roughly similar probability), since 
in this situation, even the original chain will rarely backtrack. In many circumstances, the chain 
that avoids backtracking will require little or no more time per transition than the original chain, 
though this cannot be guaranteed in all cases. 

The particular transformation described in this paper may sometimes be of practical use. For 
many problems, however, the gains may be slight or non-existent. In particular, for problems 
with continuous state spaces, and continuous transition distributions, exact backtracking has zero 
probability of occurring anyway. There seems to be scope for generalizing the idea of trying to 
avoid backtracking, however. Possibilities include trying to avoid backtracking to any of the past 
several states, and trying to avoid backtracking not just to the exact previous state, but also to 
anywhere in its vicinity. 

More generally, the results in this paper indicate that non-reversible Markov chains have a 
fundamental advantage over reversible chains, and that the search for better MCMC methods may 
therefore be best focused on non-reversible chains. The proof techniques used in this paper may 
be useful in analysing such methods. 
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Appendix: Statements and proofs of lemmas 

The following lemmas were used in the proofs of Sections [5] and El 

The first lemma justifies looking at the asymptotic variance of simulations continuing for a 
specified number of blocks, instead of a specified number of transitions. To apply it to blocks 
defined by "delta" transitions, we can extend the state space of the Markov chain to include an 
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indicator of whether the current state is the last in a block (essentially moving the decision whether 
a transition from state ^ or S is to be "marked" back to the previous transition into state A or 
B). With this extension, the set S below can consist of states A or B at the end of a block. 

Lemma 1: Let Xi,X2, ... be an irreducible Markov chain on a finite state space X , with invariant 
distribution 7r(x). Let S be some non-empty subset of X , and let f{x) be some function of state, 
whose expectation with respect to n is fi. Define 

n 

N{k) = min |n : ^/5(Xt) = fc} (63) 
t=i 

where Is is the indicator function for S. Consider the following two families of estimators: 

. n N{k) 

The asymptotic variances of these estimators are the same: 

lim nVar(/i„) = lim n Var(/ir„^(5)-|) (65) 

n— >oo n— >oo ' ^ ' ' 

Proof: Without loss of generality, suppose /x = 0. We will see that as n increases, nVar(/i„) and 
nVar(/i|-„^(5)-| ) both approach (n + n^/^+'^)Var(/i„_|_jji/2+e), where e is a positive constant to be set 
below. The diagram below may help to visualize the proof: 

n-nV2+^ n n + 7X^2+^ 
I \ 1 1 1 

First, we note that (n + n^/^+^) jln+n^/^+' = '"-An + n^^'^'^^Z, where Z is the average of f{Xi) for 
i from n + 1 to n + n^/^+^. Dividing by ^/rl^-rl^/^^ , we get 

Vn + n^/'^+^fln+nU2+. = y^n/(n + raV2+e) + n'Z (66) 

As n increases, the first factor on the right will go to one. By the Central Limit Theorem for 
Markov chains, \Z\ will be less than (n^/2+<;^-i/2+e _ with probability approaching one 

exponentially fast, so if e is in (0, (\/2— 1)/2), the term n'^Z will go to zero. It follows that nVar(/i„) 
will approach (n+n^/^+'^)Var(/z„_|_„i/2+e). (Since f{x) is bounded, an exponentially small probability 
of a large value for \Z\ cannot affect this limit.) 

Prom the Central Limit Theorem, we can also conclude that N{\n7T{S)]) will be in the interval 
(n— n^/^"*"^, n+n^/^+^) with probability approaching one exponentially fast. If so, we can write 

(n+nV2+<=)An+nV.+= = iV(rn7r(5)l)/2r„,(5)T + (n+nV2+^-iV(rn7r(5)l)) F (67) 



where Y is the average of f{Xi) for i from N{\mr{S)]) + 1 to n + n^/^"*""^. Dividing by \/n + ra^/2+e, 
we get 

V^TT^^ An+nv.+e = ^^^""^^^^^ \V^ilin.(S)] + {V^/N{\n7r{S)])) Ky] (68) 
where K = nW/^+'-N{\n7T{S)]) will be in (0,2nV2+^) if iV([n7r(5)]) is in (n-n^/^+S n+n^^+e^^ 



22 



By the Central Limit Theorem for Markov chains, \KY\ will be less than (2'n}/'^^''Y^'^'^'' = 
2i/2+e^i/4+e+e^ with a probability that approaches one exponentially fast. Since A^([n7r(5)]) will 
approach n, we can see that nVar(/i|-^^(5')-| ) will approach (n + n^/^+'^)Var(/i^i/2+^). 

Since nVar(/i„) and nVar(/i|-„^(5)-| ) both approach the same value as n goes to infinity, they must 
also have the same limit, as the lemma states. 

The second lemma justifies the claim that partial stratification of sampling for blocks cannot 
increase asymptotic variance. In applying this lemma to the proofs in Sections El and IHl Zi, Z2, . . . 
are identifiers for the type of each block (we can use = AA, 1 = BB^ and 2 = AB or BA), which 
form a Markov chain, since the distribution for the type of the next block depends only on the 
type of the previous block. The types of blocks for the modified chain are ^[,^2, . . ., which are 
stratified with respect to and 1. In the applications of this lemma, H corresponds to the sum of 
the values of / for all states in a block, and L corresponds to the number of states in this block. 

Lemma 2: Let Zi, Z2, ... he an irreducible Markov chain with state space {0, 1, 2}, whose invariant 
distribution, p, satisfies /9(0) = /9(1). LetQz for z = 0, 1,2 he distributions for pairs {H,L) G RxIR+ 
having finite second moments. Conditional on Zi, Z2, ■ ■ ., let (Hi, Li) be drawn independently from 
Qzi- Define 

Zi if Zi = 2 

(69) 

Zk + I{o,i}i^j) (modulo 2) ifZ,^2 ^ ^ 

where k = min{z : Zj 7^ 2}. (In other words, the Z'- are the same as the Zi except that the positions 
where or 1 occurs have their values changed to a sequence of alternating Os and Is.) Conditional 
on Zi, Z2, . . ., let (H'^, L'^) he drawn independently from Qz'^- Define two families of estimators as 
follows: 

n n n n 

Rr, = Y,Hi/^L„ E!^ = Y,H[/Y,L', (70) 

i=l i=l i=l i=l 

Then the asymptotic variance of R' is no greater than that of R. In other words, 

lim nVar(i?^) < lim nVar(i?„) (71) 

n^oo n— >oo 

Proof: Let iV„,„ = (1/n) ^^=1 ^{m}(^i) and iV^^ = (1/n) EHi ^{m}(^D- Note that E{Nn,m) = 
E{N'^ m) and \N'^ ol < 1/?^) so the proportions of pairs from Qq and Qi are stratified in R[^. By 
the Central Limit Theorem for Markov chains, Nn = {Nnfi, A^n,i> -^n,2) and A^^ = {N^^^q, N'^^^^N'^ 2) 
asymptotically have (degenerate) multivariate normal distributions, with the same mean vectors, 
though different covariance matrices. (Note that although Z{,Z2,... is not a Markov chain, a 
Markov chain can be defined on an extended state space that includes Z[ as a component.) 

The asymptotoic variances of Rn and R'^ can be decomposed as follows: 

nVar(i?„) = n\ai{E{Rn\Nn)) + E{nYai{Rn\Nn)) (72) 
nVar«) = nYaT{E{R'^\N'J) + E{nYar{R'jN^)) (73) 

Note that the distribution of Rn given Nn = N is the same as the distribution of R'n given A'^ = A^. 
Writing Rn = {'^/n-)Y -^i / (1/^) X] ^an apply the Central Limit Theorem to the numerator 
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and denominator, then use the delta rule to conclude that i?„ given Nn is asymptotically normal, 
with asymptotic variance that depends only on Nn (not on n). Since and N'^ have the same 
means and both are asymptotically normal, we can apply the delta rule again to conclude that the 
second term on the right in H72|) is equal to the second term on the right in H73() . 

Looking at the first terms in ((7^ and ((75|) . we can rewrite 'nNai{E{Rn\Nn)) and nVar(i?(i?[^| A''/^)) 
as follows: 

n\&T{E{Rn\Nn)) = n\&T{E{E{Rn\Nn)\Nn,2)) + E{n\w{E{Rn\Nn)\Nn,2)) (74) 
nVar(£;«K)) = n\^,{E{E{R'n\K)\K^^)) + E{nVar{EiR'JK)\K^^)) (75) 

Since the expectations of A^„,i and Nn,2 given Nn,2 = ^ are the same as the expectations of A'^^ i 
and given = "we can conclude that E{E{Rn\Nn)\Nn^2 = ^) is asymptotically equal 
to E{E{R'n\Nn)\N^ 2 — The distributions of Nn^2 and 2 are the same, so it follows that the 
first terms on the right in (|74p and (|75jl are asymptotically equal. Due to stratification, N!^ q and 
N!^^ are fixed given N^2i ^^^^ ^ar(i?(i?[j| A^4)l-^n, 2) — follows that the second terms on 

the right in ((7H) and ((7^ are related by 

E{nYaT{E{Rn\Nn)\Nn,2)) > E{nYav{E{R'„\N^)\K^2)) = (76) 

Combining these results, we can conclude that asymptotically Var(£'(i?„| A^„)) > Yar {E{R'^ \Nn)), 
and finally, that Var(i?„) is asymptotically at least as large as Var(i?^). 
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